Development and evaluation of ActSeq: A targeted next-generation sequencing panel for clinical oncology use

Purpose The demand for high-throughput genetic profiling of somatic mutations in cancer tissues is growing. We sought to establish a targeted next generation sequencing (NGS) panel test for clinical oncology practice. Methods Customized probes were designed to capture exonic regions of 141 genes selected for the panel, which was aimed for the detection of clinically actionable genetic variations in cancer, including KRAS, NRAS, BRAF, ALK, ROS1, KIT and EGFR. The size of entire targeted regions is 0.8 Mb. Library preparation used NEBNext Ultra II FS kit coupled with target enrichment. Paired-end sequencing was run on Illumina NextSeq 500 at a read length of 150 nt. A bioinformatics workflow focusing on single nucleotide variant and short insertions and deletions (SNV/indel) discovery was established using open source, in-house and commercial software tools. Standard reference DNA samples were used in testing the sensitivity and precision and limit of detection in variant calling. Results The general performance of the panel was observed in pilot runs. Average total reads per sample ranged from 30 million to 48 million, 73% ~82% unique reads. All runs had more than 99% average mapping rate. Mean target coverage ranged from 727x to 879x. Depth of coverage at 50x or more reached 87% of targeted region and 60% of targeted region received 500x or more coverage depth. Using OncoSpan HD827 DNA, which bears 144 variants (SNV/indel) from 80 genes that are within the targeted region on the panel, our somatic variant calling pipeline reached 97% sensitivity and 100% precision respectively, with near 48 million reads. High concordance with orthogonal approaches in variant detection was further verified with 7 cancer cell lines and 45 clinical specimens. Conclusion We developed a NGS panel with a focus on clinically actionable gene mutations and validated the performance in library construction, sequencing and variant calling. High concordance with reference materials and orthogonal mutation detection was observed.


Introduction
Next generation sequencing (NGS) technologies are transforming the practice of many areas in clinical molecular medicine. Identifying key oncogenic drivers and actionable genetic alterations in a high throughput manner has been increasingly adopted in current clinical oncology practice, benefiting the diagnostic, prognostic and therapeutic needs of many cancer patients. For example, National Comprehensive Cancer Network (NCCN) guidelines (version 2.2018) recommend KRAS, NRAS and BRAF mutation detection in metastatic colorectal carcinoma (CRC) for the well-established role in predicting therapeutic response [1]. CRC with mutations in exons 2-4 of KRAS or NRAS genes are not eligible for EGFR antibody therapies [2]. The College of American Pathologists and the Association of Molecular Pathology guidelines recommend EGFR, ALK and ROS1 gene tests as mandatory in non-small cell lung carcinoma (NSCLC) [3].
The pressing clinical need to characterize the genetic profile of individual tumors has led to a surge in genome-wide tests [4,5], and recently clinical whole genome sequencing and whole exome sequencing have been in use in some academic tertiary cancer centers [6]. However, for cost-effectiveness and operational feasibility, targeted panel-based NGS assays, such as those represented by FoundationOne [7] and MSK-IMPACT [8] are still the mainstream applications in oncology practice.
Clinical NGS panel tests vary considerably in design and implementation across different laboratories. Some chose to validate pre-designed panels from vendors and others customized their own panels. To newcomers in the field, it remains a challenge to establish a targeted panel test from scratch since a significant investment of operational and bioinformatics infrastructure is required. But the benefits are also obvious, as newly identified actionable target regions can be readily added to the panel to meet ever-evolving clinical needs. Considering the fact that approximately 80% of cancer patients in the USA are treated locally at community hospitals [9], where resources are not readily available with regard to clinical NGS panel testing and the related training or expertise to interpret these results, it is meaningful to experiment and expand the capability of customizing a clinical NGS panel test.
Herein we describe our experience in developing a NGS-based comprehensive panel that includes actionable small variants (SNVs and indels) in common cancer types, examining and refining its performance for a potential implementation in a community pathology laboratory that would therefore benefit the practice of oncology in the region. We developed a hybridization capture-based NGS panel test for cancer mutation, designated ActSeq, with the focus on SNV/indel variants of 141 cancer genes in solid tumors and leukemia, and evaluated the detection performance in FFPE specimens using the NextSeq 500 (Illumina, CA) platform.

ActSeq panel design
For best clinical utility, the selection of genes was primarily based on those with FDA approved targeted therapies [10], OncoKB actionable genes [11], the molecular test menu of our local clinical pathology laboratory and recommendations from local pathologists and oncologists. The panel was aimed to provide coverage of a total of 141 cancer-related genes (S1 Table: Gene list). All the current actionable genes, which bear mutations with known targeted therapy or otherwise guide clinical therapy, are included. The size of entire targeted space is about 0.8Mb. Probes for target capture were custom-designed and manufactured with MYBaits technology from Arbor Bioscience (Ann Arbor, MI). MYBait probe/bait is 80-mer in length and allows more than 5% difference between target and probes/baits. The ActSeq panel used 19240 probes (illustrated in S1 Fig and sequences provided in S2 Table) at 2x tiling for targeted regions on 141 genes (3044 exons), with a total bait territory of about 1.0Mb. Considering the targeted space, bait design efficiency is 0.87.

Tissue, cell and DNA samples
A standardized reference cell line NA12878 from Genome-In-A-Bottle (GIAB) project was purchased from Coriell Cell Repositories (Camden, NJ, USA). The truth set for variants in the NA12878 cell line was obtained via the website precision.FDA.gov. OncoSpan DNA HD827 was purchased from Horizon Discovery (Cambridge, UK). Seven ovarian and breast cancer cell lines, SKOV3, OVCAR5, OVCAR8, MDAMB231, MDAMB468, HCC38 and HCC1806, were all originally obtained from ATCC (Manassas, VA) and propagated in our research lab. Information on the genetic variants (SNV/indels) within these cell lines was obtained via the website of the Cancer Cell Line Encyclopedia (CCLE) project and manually curated (https:// sites.broadinstitute.org/ccle/).
All human cancer tissue samples were retrospectively obtained from South Bend Medical Foundation, South Bend, IN with a waiver of informed consent approved by the Institutional Review Board (IRB exempt protocol #17-11-4231). All cases were diagnostically confirmed by consultant pathologists. DNA preparation from fresh tissue or cells was performed as previously described [12]. For clinical formalin-fixed paraffin-embedded (FFPE) tumor samples, the QIAmp DNA FFFE Tissue kit was used. Selected tumor tissue blocks (tissue > = 0.5 cm in size, at least 20% tumor cells in circled area) were sectioned 8 μm thick (5 sections per case). With H&E tissue slide where regions with tumor were marked by pathologists as a reference, non-tumor was removed by dissection. Sections were processed as follows: remove paraffin in xylene; lyse under denaturing conditions with proteinase K, reverse formalin crosslinking by incubation at 90˚C; bind DNA to the membrane and wash away contaminants, and finally elute the DNA with Tris-EDTA buffer.

Mutation detection with OncoFOCUS MassArray panel
Agena OncoFOCUS Panel v3.0 is based on the MassArray System (Agena Bioscience, San Diego, CA, USA) for the detection and quantification of 230 driver mutations in BRAF, EGFR, KRAS, NRAS, and KIT. It employs matrix-assisted laser desorption/ionization time-of-flight mass spectrometry for amplicon detection and differentiation [13]. Primers were pre-designed and provided by the manufacturer for PCR (polymerase chain reaction) amplification of regions with specific mutations. PCR reactions contained Taq DNA polymerase, genomic DNA (5~10 ng), PCR primers, and dNTP. Following PCR (45 cycles), the remaining dNTPs were removed by the addition of shrimp alkaline phosphatase (SAP), after which the plates were incubated at 37˚C for 40 min. Following the PCR reaction, SAP addition, and extension reaction, the samples were desalted by resin treatment for 15 min, then spotted onto Spectro-CHIP 1 Arrays and analyzed by mass spectrometry. SpectroTYPER v4.0 software (Agena Bioscience, San Diego, CA) was used for the ultimate interpretation of the results.

Library preparation and target capture
NEBNext Ultra II FS DNA library prep kit (NEB Lab, MA, USA) was used for NGS library preparation. All starting DNA samples were quantified with Qubit 2.0 Flurometer (Thermo Fisher, Waltham, MA) and starting DNA amount was 100ng for a standard test (range of 20~200ng). Fragmentation of DNA to 200-450 bp size was carried out at 37˚C for 10 min with 2 μl NEBNext Ultra II FS enzyme mix, 7 μl reaction buffer in a total volume of 35 μl with DNA, followed by end preparation (30 minutes at 65˚C), then proceeded to adaptor ligation, which was carried out at 20˚C for 15 minutes after the addition of 2.5 μl NEBNext adapter for Illumina, 1.0 μl NEBNext Ultra II Ligation enhancer and 30.0 μl NEBNext Ultra II ligation master mix. For all library preparations starting with 100ng or more DNA, size selection with 2 rounds of NEBNext Sample Purification Beads was performed. All adapter-ligated DNA preps were PCR amplified (6 cycles) for enrichment and cleaned up with one round of Sample Purification Beads. Finally, libraries were checked with Bioanalyzer (Agilent, Santa Clara, CA) for size quality control and quantified with Qubit 2.0 Flurometer.
To enrich the target DNA specific to the regions that the ActSeq panel was designed to detect, customized biotinylated RNA baits MYBaits (Arbor Bioscience, Ann Arbor, MI) selection was applied. Briefly, every four properly prepared and indexed DNA libraries were pooled and concentrated with AMPure XP beads (Beckman Coulter, Indianapolis, IN) as one sample. For each capture reaction, per user manual version 3.01, 5.0μl Blockers mix and 7μl of pooled library input were used for a 30 μl total reaction volume. It was then heat-denatured in the presence of adapter-specific blocking oligonucleotides, which were used to bind to library adapters before biotinylated RNA baits were introduced for hybridization at 65˚C for 24 hr with lid at 105˚C. Streptavidin-coated magnetic beads (30μl for each capture reaction, Dynabeads 1 MyOne™ Streptavidin C1 magnetic beads from Invitrogen, #650-01) were used to pull out bait-target hybrids. Beads were stringently washed 4 times and finally the captured DNA library was eluted from beads, followed by PCR amplification (11 cycles), then cleaned up with AMPsure XP beads. The final library pool was quantified by KAPA library qPCR quantification Kit (Roche Sequencing, Pleasanton, CA). All sequencing was run on Illumina NextSeq 500, with paired end reading length of 150 nt (NextSeq 500/550 Mid Output v2 Kit, 300 cycles).

Bioinformatics pipelines and data analysis
NGS data processing was done on the high-performance computing cluster at the University of Notre Dame Center for Research Computing using a pipeline developed with open-source software tools. A schematic overview of the major steps is provided in S2 Fig. Raw reads in demultiplexed FASTQ files were run by FastQC (v0.11.8), and then aggregated with MultiQC (v1.8) for manual read quality inspection. All read alignments were performed with BWA--MEM (v0.7.17) and human genome reference hg19 as provided by GATK bundle (https:// gatk.broadinstitute.org/). SAMtools (1.9), GATK (v4.2.0.0, including Mutect2) and ANNO-VAR (v20210123) were used in quality control metrics, deduplication, base quality recalibration, local realignment, variant calling and variant annotation. Variant calls were limited to predefined target regions and without paired normal control. The removal of germline polymorphism was achieved by filtering against population databases, including dbSNP(Common 151) and gnomAD(v2.1), and all variants with minor allele frequency >1% in the databases were excluded. VarSeq (v2.0.2) from Golden Helix, Inc. (Bozeman, MT) was used for variant filtration and annotation of clinical samples. In particular, variants that met any one of the following, VAF (variant allele frequency) <0.05, read depth < 50x or variant supporting reads <5 were removed. Manual review was used for variant validity confirmation. Integrative Genomics Viewer (IGV, v2.11.0) was used for manual data inspection and visualization. Raw data from all cell line samples are available from SRA (BioProject PRJNA803819) and the bioinformatics pipeline (snakemake v6.14.0) is available on GitHub (https://github.com/ harpernd/actseq, release v1.0.0). Raw data from patient samples are not publicly sharable per IRB and dbGaP repository policy.
RTGtools vcfeval from Realtime Genomics (Hamilton, New Zealand) was used to evaluate VCF (Variant Call Format) outputs from variant calling in comparison with known truth set VCF file from the DNA sample supplier. Performance metrics include sensitivity, precision and F-score. For sensitivity estimation, based on the truth set, variants were classified as true positive (TP) or false negative (FN) if not detected. Sensitivity was expressed as TP/(TP+FN). Precision estimation was based on variants call true positive (TP) or false positive (FP) if not in truth set and it was calculated as TP/(TP+FP), this is also known as positive predictive value (PPV). F1 score was calculated as 2TP/(2TP+FP+FN), which is the harmonic mean of precision and sensitivity. For statistical significance of group comparisons, Welch's t-test was used and significance level set at α<0.01.

Characteristics of ActSeq next generation sequencing panel
Sequencing reactions with the ActSeq panel were run in 10 batches, yielding a mean of 39 million reads per sample (range 23~62 million) with base quality above Phred score 30. In Table 1, basic characteristics were extracted from the first 3 batches of samples, 8 samples per run, and reflected the general performance of this panel design. Total reads per sample range from 30 million to 48 million, 73%~82% of reads are unique (i. e. not duplicates), off-bait reads are 53%~58%. All samples had more than 99% mapping rate. Less than 4% of targeted regions were not covered. The mean target coverage ranges from 727x to 879x. As shown in Fig 1A, a depth of coverage at 50x or more was reached for 87% of targeted regions and 60% of targeted regions received 500x or more coverage depth. The coverage on exonic regions of individual genes is illustrated in Fig 1B. Due to the dominant use of FFPE material in pathology labs, we compared the metrics of library preparation and sequencing of DNA samples from fresh frozen material vs those from FFPE samples (Fig 2A). When using a starting DNA amount of 100ng and a bead-based size  selection method, no major difference was discovered between the fresh group and FFPE group (4 samples in each group) in terms of total reads, unique reads, or depth of coverage on targeted regions. Using DNA from FFPE materials, we compared the impact of different starting amount of DNA on panel performance. Based on the recommendation from NEBNext Ultra II FS DNA library prep kit manual, when the DNA amount is less than 100ng, standard library size selection was not applied, instead one round of the clean-up procedure was used. Interestingly, it was observed that using DNA samples from the same 4 patients, the group with 50ng of starting DNA (without size selection) exhibited a higher yield of reads, higher number of unique reads and higher depth of coverage on targets than those starting with 100ng DNA as show in Fig 2B  (although not statistically significant). Even lower amounts of starting DNA amount are also feasible. The lowest amount of DNA we used was 13ng (Sample H21023), which still produced a total of 44 million reads, 15 million unique reads and reached a mean target coverage at 52x. The known variant in this sample was clearly identified by our ActSeq NGS panel test.

Performance in variant detection with reference materials
The variant calling pipeline for the ActSeq panel was established following GATK best practice [14] and the workflow is illustrated in S2 Fig. Although the ActSeq panel is meant for somatic mutation detection, to gain familiarity with the variant calling and filtering process we purchased DNA from NA12878 cells. This sample is from a transformed B-Lymphocyte cell line that has been extensively tested with NGS methods and is now often used as reference sample to NGS variant calling workflows, a "genome-in-a-bottle" as promoted by National Institute of Standards and Technology (NIST) [15]. This sample was run through the ActSeq panel using a separate germline variant calling pipeline made with GATK HaplotypeCaller. High sensitivity, precision (also known as positive prediction value, PPV) and F1 score were obtained for NA12878 germline variant calls as shown in the first row of Table 2.
For somatic variant calling, we used OncoSpan HD827 DNA sample for performance evaluation. According to the supplier, OncoSpan is a well-characterized, cell line-derived reference standard containing 386 variants across 152 key cancer genes. Among them, 80 genes (144 variants) are within the targeted region on the ActSeq panel. Our MuTect2 based somatic variant calling pipeline could reach 97% sensitivity and 100% precision respectively with near 48 million reads. The system is robust enough that when the number of reads was reduced to half of the original, the same sensitivity and precision in variant detection was reached ( Table 2).
Taking advantage of the Cancer Cell Line Encyclopedia (CCLE) project that has characterized the commonly used human cancer cell lines with whole genome sequencing and/or whole exome sequencing, ActSeq was run with DNA samples extracted from 7 cell lines from our research lab (Table 3). Out of the 42 variants that are within the targeted region of ActSeq, 38 (90%) of them were detected by the ActSeq workflow. The 4 missed variants could be attributed  to the lab-specific propagation of those cell lines, not a technical failure of the variant caller or the workflow, because upon manual review of the BAM files, all the 4 positions were covered with proper number of reads, the least for them is on SF3B1 gene p.Q534 in HCC38 cells, but still at 462x (last column in Table 3). The allele frequencies of the 4 variants from CCLE data are 0.21~0.23, indicating they are not the dominant clone in the samples they were identified.

Concordance with orthogonal approach in variant detection on clinical specimens
The MassARRAY-based OncoFOCUS panel (v3) was previously adopted in our community pathology lab as a fully validated clinical test of common mutations in BRAF, EGFR, KIT, KRAS, and NRAS genes, which are implicated in various cancers, particularly colon cancer, lung cancer and melanoma. Samples from a cohort of 45 formalin-fixed paraffin-embedded materials with known positive results from OncoFOCUS were tested with the ActSeq NGS panel for technical validation. The composition of cancer types is illustrated in Fig 3A. All the known variants detected by OncoFOCUS were discovered by the ActSeq NGS panel, and with comparable variant allele frequency (VAF) as listed in Table 4. Distribution of VAF of all valid non-synonymous variants detected by ActSeq panel is plotted in Fig 3B. More details on the variants are listed in S3 Table. Higher yield in variant detection from NGS panel sequencing is obvious as in Fig 3C, which illustrates the top 20 genes with SNV/indel variants in the 45 patient specimens. KRAS and TP53 are unsurprisingly the top 2 genes with the highest number of variants in this cohort. With the OncoFOCUS panel, only 1 variant was found from each patient. With the ActSeq NGS panel, a total of 276 variants from 67 genes were discovered, at least 2 variants per patient (range of 2-19 variants per patient). The variant types are illustrated in Fig 3D, 62% of them are missense mutations, 28% are indels (in-frame or frameshift) and 9% are stop-gains.

Discussion
The completion of The Cancer Genome Atlas (TCGA) project consolidated our understanding of the major oncogenic signaling pathways, providing a more complete picture of drivers in oncogenesis [16]. The National Cancer Institute Molecular Analysis for Therapy Choice (NCI-MATCH) trial demonstrated the feasibility and efficacy of using NGS to triage patients to investigational therapy [17]. The use of NGS panels in general clinical oncology practice is rapidly increasing as well. Often, the limited amount of available tissue encourages a highthroughput approach, particularly in non-small cell lung cancer, wherein the availability of specimens after initial pathological diagnosis is often restricted. A targeted clinical NGS assay Note: AD, number of variant supporting reads in ActSeq; DP, total depth of reads in ActSeq.

PLOS ONE
Clinical cancer NGS panel development ActSeq is a customized NGS panel that aims to cover clinically actionable genes and is an intended upgrade from the mass spectrometry-based OncoFOCUS panel that was previously adopted. It is an expansion from 5 genes with specified variants to all exons of 141 genes. The choice of sequencing strategy for a clinical test has important ramifications on the variant calling process, as clearly there are advantages of hybridization over amplicon based approaches in terms of target enrichment [18]. On the other hand, however, hybridization capture methodology is prone to off-target enrichment. With MYBAIT used in the ActSeq panel, the short length of probes makes it more vulnerable, particularly when the probes bear similarity with non-coding sequences. Indeed, a high percentage off-bait reads, which ranged from 53%~58% in the first three runs, was observed. However with sufficient coverage of the targeted regions (727x~879x), this does not negatively affect downstream variant calling. It is also possible to reduce off-bait reads via improving wet-lab procedures in future iterations of this assay.
Depth and breadth of sequencing coverage on targeted regions are critical factors in NGS panel performance and have a deep impact on the variant calling. A typical average read depth in a NGS panel assay is 100~500x [19]. The ActSeq panel has a mean target coverage over 700x. When tested with NA12878, our panel obtained high sensitivity (98%) and high PPV (98%), which not only verified the success of the bioinformatics workflow for variant calling but also indicated the proper coverage in the targeted regions. The higher depth and high percentage of targeted region achieved in this panel ensures that it is possible to detect somatic variants at low allele frequency. This is particularly important, since in cancer tissue the presence of non-cancerous components is inevitable. In a real-world scenario, tissues with low tumor purity or sub-clonal events due to tumor heterogeneity would be encountered, a significant portion of clinically actionable variants have low allele frequencies [20]. Therefore, higher coverage is always preferred whenever possible.
Clinical tumor specimens are often in the form of FFPE materials, wherein DNA degradation and modification may impact library preparation and variant detection. However it has already been found that with higher coverage, reliable mutation calling is possible [21,22]. In current practice, most NGS panels in clinical oncology use FFPE materials. We tested the performance of FFPE vs fresh-frozen tissues in library preparation and sequencing and no difference was found. In addition, to meet the reality of the wide variation in the amount and quality of DNA from clinical specimens, we have established a protocol that can handle as little as 13 ng starting DNA, enabling analysis of lower quality FFPE materials.
Assessing the performance of a bioinformatic workflow, particularly the metrics of variant calling for NGS assays is critical and challenging since ground truth for clinical samples is rarely known and variants cannot be individually validated or quality controlled. However, in recent years standard reference materials with a well-curated variant truth set, including NA12878 cell DNA (Genome-in-a-bottle, for germline use) and OncoSpan HD827 DNA (from Horizon Discovery, for somatic use) used in the current report, have been developed by NIST [15] and commercial entities. These efforts have greatly facilitated the benchmarking of NGS panel development. We obtained high sensitivity and precision in both germline and somatic variant calling. By down sampling, we found that reducing the reads to half the amount does not affect the performance of our Mutect2 based variant calling pipeline, as we detected the same number of variants in the OncoSpan HD827 sample corresponding to our target regions (134 variants) as with the full set of reads. When compared to 45 clinical FFPE samples that were previously tested with an OncoFOCUS panel, a mass spectrometry-based mutational detection approach, we have confirmed that all the known mutations were identified by the ActSeq panel test, and in comparison, an apparent advantage with the NGS panel is higher diagnostic yield.
There are also obvious limitations of the ActSeq panel, including the lack of capability to determine copy number variation (CNV) and structure variation (SV) detection, as these events are commonly involved in oncogenesis. Amplification of certain oncogenes, such ERBB2 and EGFR, are clinically relevant to targeted therapeutics. Many SV events are of diagnostic value, especially those in leukemia and soft tissue sarcoma. Microsatellite instability index (MSI) and tumor mutational burden (TMB) are also not addressed in this panel. All of these features, plus additional genes or variants, could be added in future continuous development of the panel. The aforementioned high percentage of off-bait reads of the panel in this current form means there is room for improvement in cost reduction. However, since many different cancer genes could be included in a given panel, NGS allows consolidation of the clinical laboratory workflow, leading to further efficiency and savings. In the long run, the wide adoption of NGS in healthcare would be beneficial and cost-effective [23].
It should be noted that while the NGS related bioinformatics tools and workflow construction have matured over recent years, variant filtering, annotation, prioritization and clinical interpretation remain to be daunting tasks for clinical labs newly starting NGS assays. To extract those variants with tangible clinical value and biological significance to a meaningful report ready for sign-out requires more dedicated tools and resources customized to the actual needs of the clinical lab.
In summary, we developed a NGS panel with a focus on clinically actionable mutations and validated the performance in library construction, sequencing and variant calling. Full concordance with an orthogonal mutation detection approach was observed in 45 clinical specimens. Our results illustrate the feasibility for a panel development in a community pathology lab, suitable for clinical application, improving diagnosis, prognosis and personalized therapeutic decisions.